Evaluation of measures of sustainability and sustainability determinants for use in community, public health, and clinical settings: a systematic review

Background Sustainability is concerned with the long-term delivery and subsequent benefits of evidence-based interventions. To further this field, we require a strong understanding and thus measurement of sustainability and what impacts sustainability (i.e., sustainability determinants). This systematic review aimed to evaluate the quality and empirical application of measures of sustainability and sustainability determinants for use in clinical, public health, and community settings. Methods Seven electronic databases, reference lists of relevant reviews, online repositories of implementation measures, and the grey literature were searched. Publications were included if they reported on the development, psychometric evaluation, or empirical use of a multi-item, quantitative measure of sustainability, or sustainability determinants. Eligibility was not restricted by language or date. Eligibility screening and data extraction were conducted independently by two members of the research team. Content coverage of each measure was assessed by mapping measure items to relevant constructs of sustainability and sustainability determinants. The pragmatic and psychometric properties of included measures was assessed using the Psychometric and Pragmatic Evidence Rating Scale (PAPERS). The empirical use of each measure was descriptively analyzed. Results A total of 32,782 articles were screened from the database search, of which 37 were eligible. An additional 186 publications were identified from the grey literature search. The 223 included articles represented 28 individual measures, of which two assessed sustainability as an outcome, 25 covered sustainability determinants and one explicitly assessed both. The psychometric and pragmatic quality was variable, with PAPERS scores ranging from 14 to 35, out of a possible 56 points. The Provider Report of Sustainment Scale had the highest PAPERS score and measured sustainability as an outcome. The School-wide Universal Behaviour Sustainability Index-School Teams had the highest PAPERS score (score=29) of the measure of sustainability determinants. Conclusions This review can be used to guide selection of the most psychometrically robust, pragmatic, and relevant measure of sustainability and sustainability determinants. It also highlights that future research is needed to improve the psychometric and pragmatic quality of current measures in this field. Trial registration This review was prospectively registered with Research Registry (reviewregistry1097), March 2021. Supplementary Information The online version contains supplementary material available at 10.1186/s13012-022-01252-1.


Introduction
Maintaining the delivery and health impact of evidencebased interventions (EBIs) over time is a challenge across a range of community, public health, and clinical settings [1][2][3]. A 2020 systematic review of 18 multi-component school-based public health interventions found that none of the interventions continued to be delivered in their entirety (i.e., all components) once active implementation support (i.e., provision of start-up funding and/or other resources) ceased [4]. Similarly, only seven of 18 evaluations sustained clinical practice guidelines in a variety of healthcare settings following active implementation in a recent systematic review [5]. Understanding why EBI implementation attenuates over time, and how best to support their long-term delivery is necessary to ensure that implementation investments are worthwhile. This concept, referred to as "sustainability, " is an important outcome in implementation science [6].
Similar to other emerging fields, the definitions relating to concepts of sustainability have been varied and at times conflicting [7], emphasising the call for a nomenclature in this field. However, more recently a recommended definition of sustainability has been recognised as "the continued delivery of an innovation or intervention, potentially after adaptation, at a sufficient level to ensure the continued health impact and benefits of the intervention" [7]. While sustainability determinants are defined as "the characteristics or factors associated with the continued use and impact of an EBI" [8][9][10]. Several frameworks recognise and conceptualise the complex and dynamic nature of sustainability [2,[11][12][13]. The Integrated Sustainability Framework developed by Shelton and colleagues (2018) [2] outlines recommendations on how sustainability should be conceptualised and measured. It also organises influential multi-level factors (i.e., determinants) into five domains (i.e., outer context, inner context, intervention characteristics, processes, and implementer and population characteristics) [2,14].
Central to any field is measurement validity, or the ability to accurately measure relevant concepts, outcomes, and constructs. To do this, a measure should comprehensively and adequately cover the intended construct. This is known as content validity [15] and is recognised as one of the most important measurement properties [16]. For measures of sustainability as an outcome to have adequate content validity, they should encompass the features of a multi-component definition, such as that proposed by Moore and reflect concepts of time, continued delivery of the EBI, maintained behaviour change, evolution and/or adaptation of the program, and continued health and other benefits [7]. Measures should also illustrate reliability and evidence of other domains of validity (e.g., concurrent validity), to ensure accuracy and reduce error. Finally, measures should exhibit important pragmatic qualities, including easy access, use, scoring, and interpretation [17]. Pragmatic qualities are less frequently evaluated but are essential in ensuring the uptake of reliable and valid measures.
Identifying and measuring sustainability, as well as factors related to sustainability (i.e., determinants), is complex given the diverse and dynamic settings being studied. Consequently, many existing measures have only been used once [18], illustrating limited standardisation in measurement. This makes it difficult to compare and synthesise findings across studies. Furthermore, there has been a lack of distinction between measures of sustainability determinants and sustainability as an outcome [2,9].
High-quality systematic reviews on available measures, their psychometric and pragmatic properties, and how they have been empirically used are essential for providing evidence-based recommendations on which measures to use, identifying gaps in measurement and highlighting areas for future research [19]. There are two systematic reviews exploring measures of sustainability as an implementation outcome in health care settings focused on mental health and substance use [18,20]. Overall, psychometric assessment reporting was poor, with only one psychometric indicator; norms, reported in more than half of the identified sustainability measures. They also found that most (54%) measures were used only a single time. While these two reviews provide a thorough evaluation of sustainability measures, they were limited by a narrow focus on behavioral health settings and a subset of psychometric and pragmatic properties. A third review, by Moullin et al. [8], used snowball sampling to identify sustainment and sustainability measures across a broader range of community, public health, and clinical settings, offering general guidance about how and in what circumstances each measure could be used, but no formal assessment of their quality was undertaken.
Collectively, these three reviews offer an excellent foundation for informing a comprehensive systematic review and critical assessment of both the psychometric and pragmatic qualities of measures of sustainability (as an outcome) and sustainability determinants, across a range of settings. This review addresses important gaps by allowing researchers to identify where robust and suitable measures exist, to reduce unnecessary duplication, and provide practical guidance to end-users in selecting the most relevant measure for their setting.
Specifically, we aimed to: (1) Assess content validity by mapping the constructs covered by identified measures of: (a) sustainability as an outcome to the multidimensional definition of sustainability proposed by Moore et al. [7]; and (b) sustainability determinants to the domains and constructs outlined by the Integrated Sustainability Framework [2] (2) Assess the psychometric and pragmatic qualities of identified measures using a standardised assessment tool (3) Describe how each of the identified measures have been applied in empirical research.

Methods
This systematic review is reported according to the Preferred Reporting Items for Systematic review and Meta-Analysis Protocols checklist (PRISMA) [21] (see Additional file 1) and followed established procedures used by other systematic reviews of measures of implementation outcomes [18,20,22,23]. It was registered prospectively with Research Registry (revie-wregistry1097) prior to the final database search being conducted.

Search strategy
An extensive search strategy, informed by previous reviews of implementation measures [18,[24][25][26][27] and reviews on sustainability determinants [14], was employed to identify eligible measures of sustainability and sustainability determinants. We searched the following electronic databases on 6 of June 2021: the Cochrane Central Register of Controlled trials (CENTRAL), MED-LINE, EMBASE, PsycINFO, ERIC, CINHAL, and SCO-PUS. The search included keywords relevant to the three levels of search terms: (i) terms relevant or synonymous with the constructs of interest, sustainability, and sustainability determinants (e.g., sustain*, implement*); (ii) psychometric properties (e.g., psychometric*, reliab*); and (iii) setting (e.g., public health, evidence-based medicine). Please see Additional file 2: Table S1a to S1G for an example of the search strategy. Similar to previous reviews, we defined a measure as a multi-item survey, questionnaire, instrument, tool, or scale [24] that is quantitatively scored. Reference lists of previous relevant reviews were also searched. New measures published outside our search date and identified through journal alerts and snowball searching were also included. For aims 1 and 2, only full-text articles were eligible for inclusion. The authors of conference abstracts were contacted to obtain full-text articles. Online repositories of implementation measures, including the "Society for Implementation Research Collaboration Instrument Repository" [28] and the "Dissemination & Implementation Models in Health Research and Practice" [29] web tool, were also searched. Finally, a forward literature search was undertaken for each relevant measure, whereby two researchers independently searched the name of identified measures within Google Scholar. The first 100 hits were checked for relevance or until relevant articles were no longer being identified. A citation search of the original development paper for each measure was conducted to identify empirical studies that used each measure. For measures that did not have a specified name, only the citation search was conducted. These searches were conducted independently by pairs of researchers (either BM, AH, CG, SH, or KA) between April 2021 and May 2022. For the third aim, published scientific manuscripts, reports, abstracts, trial registrations, and protocol papers describing the empirical use of eligible measures were included.

Inclusion/exclusion criteria
Publications were included if they reported on the development, psychometric evaluation, or empirical use of a multi-item, quantitative, self-report measure that is scored, of sustainability as an outcome or sustainability determinants, designed to be used in a community, public health, or clinical setting. Individual measures were the unit of interest as the development and psychometric evaluation of measures are usually reported across multiple publications. Empirical studies that applied the identified measures were included, to allow for an evaluation of how identified measures have been used in the field. Only measures that assumed a reflective measurement model of sustainability or sustainability determinants were included (i.e., consist of items that sought to reflect the underlying construct of sustainability or sustainability determinants and did not alter or define the construct such as an index) [26]. Publications of any language were included, and wherever possible, non-English publications were translated via colleagues or contacts proficient in the language of interest or Google translate. No restrictions were made on health condition or the target population. Published or unpublished full-text articles or papers were eligible. We excluded measures that were based on a formative measurement model (i.e., items define the underlying construct such as an index), as such measures were not relevant to the constructs we were assessing, and different properties are used to assess their rigor. Unscored checklists and single item tools were excluded, as these serve a different purpose than measures designed to quantify an underlying construct. Measures designed explicitly for a specific study and not for wider use in the field (i.e., one-time use measures) were excluded, as were qualitative measures.

Study selection
The search results from the electronic databases were managed and duplicates identified using EndNote version X9.2 software (Thomson Reuters, PA, and U.S.) The de-duplicated library was imported into Covidence [30], where article screening occurred. Both title and abstract and full-text screening were conducted independently by two members of the research team (either AS, BM, AH, NN, NI, NM, or KA). Conflicts were resolved by a third member of the research team (AH or AS).

Critical assessment
The pragmatic and psychometric evidence of each eligible measure was assessed and scored using the Psychometric and Pragmatic Evidence Rating Scale (PAPERS) [17,31]. PAPERS includes 14 items that assesses nine psychometric properties and five pragmatic features (see Table 1). Each item is scored using a six-point Likert scale ranging from −1 (poor) to 4 (excellent) [17,31]. The PAPERS criteria were applied to each individual measure, rather than an individual study or publication, as multiple publications often report on different aspects of a measure's pragmatic and psychometric properties. For measures that had multiple reports of the same pragmatic or psychometric property, for instance in the case of multiple studies assessing the responsiveness, the median score was used. If the median value resulted in a non-integer, the score was rounded down [18,23,27]. Data were only assessed against the PAPERS psychometric criteria if it was being explicitly used to evaluate the psychometric properties of that measure. Due to the typically poor reporting of pragmatic indicators of a measure, grey literature, such as scoring manuals, were reviewed to assess such qualities. The quality of empirical studies was not assessed, as we were only interested in describing the application and use of eligible measures, aspects which are not influenced by the rigour of the research design or potential bias.

Data extraction
Data were extracted independently by two trained members of the research team (either NN, ED, AH, or AS), using a pre-piloted data extraction tool developed specifically for this study (Additional file 3). The data extraction form was programmed using REDCap; an electronic data capture tool hosted on the Hunter New England Population Health server [74,75]. An overview of the main fields programmed in the data extraction tool are shown in Additional file 3.
To assess content coverage of the included measures, the items from each measure were mapped to constructs important to sustainability and sustainability determinants. For measures of sustainability (as an outcome), items were mapped to the five constructs outlined by Moore et al. [7] comprehensive definition of sustainability (see the "Introduction" section). Items from measures of sustainability determinants were first mapped to lowerlevel constructs that define five higher-level domains proposed by the Integrated Sustainability Framework (i.e., outer context, inner context, intervention characteristics, processes, and implementer and population characteristics) [2] (see [14] for a more detailed description of the Integrated Sustainability Framework domains and constructs). Item mapping followed similar procedures undertaken in previous reviews [23,76], whereby two research team members proficient in the content area of sustainability (AH & AS), independently extracted and mapped the items from each measure to the domains of the relevant frameworks outlined above. We classified a measure as incorporating components of a specific construct if at least one item was mapped to that construct. Discrepancies This is based on the intended focus of the measure and may differ to how the measure was originally developed and evaluated, which may reflect a more narrow focus than intended were resolved through discussion and input by two review members. We classified each measure as assessing either sustainability (as an outcome) or sustainability determinants based on the content of their items and which definition (see above) the items predominantly aligned with.

Synthesis methods
Data was cleaned and summarised using SAS version 9.3. The constructs covered by each of the measures according to Moore et al's. [7] definition of sustainability for measures assessing sustainability as an outcome, and the five higher-level domains from the Integrated Sustainability Framework [2] for measures of sustainability determinants, were summarised and organised in a table.
Descriptive statistics were used to summarise the quality of each measure against the proposed nine psychometric indicators and five pragmatic domains outlined by PAPERS [17]. Where possible, a total quality rating score for each of the pragmatic and psychometric domains was calculated as well as overall, for each measure by summing together the relevant items. Total overall scores range from a possible −14 to 56 [17,31]. Summary tables were produced that included information describing the characteristics of the measure, the specific setting, and any sub-groups in which the measure has evidence of validity. The use of each measure in empirical studies was summarised descriptively.

Search results
A total of 32,782 scientific articles were identified from the database search, from which 402 full texts were screened and 37 were included in the final review. An additional 186 relevant articles were identified from the grey literature search, resulting in 223 articles included in this review, representing 28 individual measures. See Additional file 2: Figure S1, for a summary of the article selection, and Additional file 2: Table S2 for a summary of exclusion reasons for measures included in previous reviews and repositories. Table 1 describes the characteristics of the included measures. Two measures assessed sustainability as an outcome, 25 assessed sustainability determinants, and one explicitly assessed both. Four measures were designed to assess different constructs other than those more directly related to sustainability or sustainability determinants. Twenty measures were based on a theory or framework, and 20 (of the 28 measures) included input from the target population during the development stage. Seventeen measures were developed or psychometrically evaluated in the USA, four in Australia, two in the Netherlands, and one each in Sweden and UK. Three measures were developed and/or psychometrically evaluated in more than one country. All 28 measures were available in English, while only five measures were also available in a language other than English.

Overview of identified measures
In relation to the scope of the identified measures, 11 were general measures designed to assess sustainability as an outcome or sustainability determinants in relation to any type of EBI within any setting. Four were general in terms of the target EBI but were restricted to a particular setting (e.g., clinical, public health, school). Seven could be used within any setting but were designed for a specific EBI or category of EBIs (e.g., health promotion programs, community-based programs, chronic disease prevention programs). Three were designed for a specific type of EBI or category of EBIs within a specific setting (e.g., depression care within a clinical/health care setting). Three were developed for assessing determinants of sustainability for the same specific EBI, the school-wide positive behavioral interventions and supports programs, which is delivered within the school setting.
Twenty measures were designed to be completed by both executive (e.g., supervisors, directors, administrators) and frontline staff (i.e., staff responsible for the day-to-day delivery of the EBI). Three measures were designed to be completed by executive staff only, and two by frontline staff only. Three were completed by researchers or purveyors responsible for monitoring or supporting the implementation of an EBI. Table 2 describes the constructs covered by measures of sustainability according to Moore's definition [7]. All three measures that assessed sustainability as an outcome covered the continued delivery of the EBI, while both the Provider Report of Sustainment Scale (PRESS) measure and the sustainment sub-scale from the SMSS incorporated aspects of behavior change. Only one measure incorporated concepts of time, evolution/adaptation, and continued benefits. None of the three measures incorporated all five main concepts related to sustainability as an outcome. Table 3 describes the constructs covered by the 26 measures of sustainability determinants according to the higher-order domains of the Integrated Sustainability Framework [2]. Ten measures covered aspects of all five higher-level domains. However, no measure covered all constructs that define the five higher-level, multi-level domains (see Additional file 2: Tables S3 to S7). "Inner context factors" was the most frequently covered domain with all but two measures (n=25) covering aspects of this domain. This was followed by the domains of "intervention characteristics" (n=23), "outer context" (n=18),  "processes, " and "implementer and population characteristics" (n=17 measures each). When assessing the lowerlevel constructs that define the five higher-level domains of the Integrated Sustainability Framework, the "inner context factors" and "outer context factors" domains were the most broadly covered (Additional file 2: Tables S3  and S4). Conversely, the "interventionist and population" domain and "characteristics of the intervention" were the most sparsely covered domains with only one and no measures, respectively, assessing all aspects of these domains (Additional file 2: Table S6 and S7). Table 1 details the overall PAPERS score for each measure, which were calculated by summing the ratings obtained from the individual items assessing the psychometric qualities (Table 4) together with the ratings for the individual items assessing the pragmatic qualities ( Table 5). The PRESS measure, which measures sustainability as an outcome, was the highest-rated measure overall, with a total score of 35. Of the measures of sustainability determinants, the School-wide Universal Behavior Sustainability Index -School Teams (SUBSIST) measure obtained the highest PAPERS score with 29, followed by the Clinical Sustainability Assessment Tool (CSAT) and Sustainment Measurement System Scale (SMSS) each with a score of 28. Specifically, the SUB-SIST had a higher overall score due to a larger number of psychometric properties being assessed compared to the CSAT and SMSS. Table 4 details the median score for the psychometric quality indicators from the PAPERS scale for each measure. Overall, PRESS was rated the highest in psychometric quality with a score of 18 out of a possible 36, followed by the SUBSIST measure with a score of 14. At an individual psychometric property level, internal consistency was the most frequently assessed (84%, n=26), with median scores ranging from 1 (minimal/emerging) to 4 (excellent). The second most frequently assessed psychometric property was structural validity (61%, n=19; median range; −1 to 4); followed by norms (55%, n=17; median range: −1 to 4). Few measures were assessed for responsiveness (n=1) or predictive validity (n=1). Additional file 2: Figure S2 provides a head-to-head comparison of the psychometric ratings of included measures. were rated the highest in pragmatic quality, with each of these measures scoring 18 out of a possible 20. All three of these measures assessed determinants of sustainability. Of the three measures of sustainability as an outcome, the PRESS measure scored the highest with a total score of 17. All pragmatic items were scored for all measures, with most of the information obtained from grey literature sources, such as websites or publicly available scoring manuals. In terms of individual items, the cost was the most highly rated with all measures scoring excellent (score of 4), as they were freely available either publicly from a website, within a published manuscript, or accessed via contact with the authors. The most poorly scored pragmatic quality was "ease of interpretation, " with only two measures scoring the highest rating of excellent and 17 scoring minimal/emerging (score of 1). Additional file 2: Figure S3 provides a comparison of the pragmatic ratings of included measures. Table 6 describes how each of the identified measures have been used in empirical research to date. Eleven measures have yet to be used in an empirical study; six of which were only published since 2020. The most frequently used measure of sustainability as an outcome was the Stages of Implementation Completion (SIC) measure, which has been used in 27 studies. For measures of determinants of sustainability, the most frequently used was the Change Process Capability Questionnaire (CPCQ) (n=34), followed by the Normalisation Measure Development questionnaire (NoMAD) (n=29) and Program Sustainability Assessment Tool (PSAT) (n=20). Geographically, the NoMAD was the most widely used across 15 countries. All other measures have been used in six or fewer countries. Of the 16 measures that have been used in empirical research, six were used to assess constructs other than sustainability determinants or sustainability as an outcome. Eleven measures were adapted prior to their use, despite only two measures (SIC and NoMAD) having been explicitly designed for adaptation in primary research. The most common adaptations included: removing items, adding items, changing the wording of items, changing the response scale, and deleting domains.

Discussion
We identified a growing number of measures relating to sustainability determinants, and, to a lesser extent, measures of sustainability as an outcome. Despite this increase, we found that the included measures had limited coverage of the key constructs of sustainability and were of variable quality, and only a small number        Only psychometric properties relating to the sustainability aspects of this scale were considered, and at the time of this review, none was found to be available for assessment. There are however properties relating to the other aspects of this scale [35,79] were consistently used in empirical studies. This review identifies areas where future research is warranted, to ensure improvements in this field while minimising research waste. It also provides important information that end-users can use to help compare and select the most appropriate measure for their setting.

General considerations across all identified measures
Most of the measures identified were developed and/ or psychometrically evaluated in the USA (20 out of 28), limiting their cross-cultural validity. This may also limit content coverage of constructs, as the outer context (related to broader policy and social context) has     been identified as an important determinant of sustainability [2]. Only five of the 28 measures are available in languages other than English, of which only one, the NoMAD, has been translated and psychometrically evaluated in several languages. Translation and validation of measures is an extensive and costly process that requires specialised expertise [80]. This is a major limitation of the field and has implications for equity, as it highlights the inadequate access that non-English speaking populations and countries have to rigorous and standardised measures relating to sustainability. Without this access, researchers often create their own measures or alternatively, translate, and adapt existing measures without proper validation. Creating or leveraging existing research consortiums that share resources across groups may help avoid this. Only 11 (two for sustainability as an outcome and nine for sustainability determinants) of the 28 identified measures were designed for general use (see Table 1). Fortunately, simple changes to the referent in a measure (e.g., changing the referenced EBI) should not alter the psychometric properties. In at least five [36,37,41,59,61] measures, the items appeared to have content specific to the EBI and/or setting (beyond simple referent values) that would require extensive adaptation that may warrant new psychometric evidence. The advantages of generalised measures are the ability to standardise research, allowing for replication and comparability across studies, while reducing research waste due to use of one-off measures. The need for more generalised measures is emphasised by our finding that most measures were adapted before use in empirical studies in ways that might compromise their psychometric evidence. However, it can be difficult to ensure that generalised measures are sensitive and informative, as the issues affecting sustainability can vary and depend on the setting and EBI under investigation [2]. Item banks, informed by item response theory, strike a balance between generalisability and specificity of a measure. The resulting standardised measures include survey items tailored to specific characteristics, such as settings, populations, and/or EBIs, which have been calibrated to create standardised scores that are comparable across the tailored items [23]. The use of item banks for measures within implementation science is not a new concept and has been suggested by other reviews of implementation measures [23]. Despite such calls few efforts have launched to create item banks for implementation science, which may be a focus for research consortia in the future.
The majority of the included measures (n=20) were designed to be completed by both the executive/management staff, who oversee the implementation of an EBI, and frontline staff, responsible for the day-to-day delivery of an EBI (see Table 1). In most instances, both executive and frontline staff are required to report on all items, regardless of their role in EBI delivery. Only the SIC, Sustainable Implementation Scale (SIS) and SUBSIST scales seem to distinguish issues between these two roles with separate questions for the different types of staff. The issues impacting on sustainability exist at varying levels within organisations [2,8,59]. Therefore, different levels of staff roles may have limited understanding of some determinants of sustainability or aspects of sustainability. For example, frontline staff may not be aware of budgetary constraints that administrators manage. Conversely, management may not possess the same level of day-today EBI implementation knowledge as front-line staff. If participants cannot accurately respond to a measures item, the usefulness of the data collected is compromised. Different scales, or at least items, within a scale may need to be completed by different types of staff to ensure that the full range of issues impacting sustainability are accurately captured.

Measures of sustainability as an outcome
Of the 28 included measures, only three were classified as measuring sustainability as an outcome. This may reflect the difficulties in adequately assessing sustainability as an outcome via self-report, standardised scales, to validly capture continued delivery and benefit of specific EBIs. Instead, it may be more appropriate to measure sustainability via other means, such as using a measure that asks directly about the continued delivery of the EBI or via observation. For instance, the SIC measure is an objective measure of the implementation process that records the timing and continued delivery of the main components of an EBI. It is also being extended to comprehensively cover the sustainability phase following implementation [81], as currently, it is focused predominantly on measuring the earlier phases of implementation. Following such extensions and their rigorous psychometric evaluation, the SIC will make for an appealing comprehensive measure of the implementation process, including the sustainability phase. However, in some instances (e.g., where resources and time may be limited), the SIC may not be appropriate as it is more complicated to administer, requiring specific training, input from multiple data sources, and completion by researchers and purveyors over an extended period of time. Alternatively, a general standardised measure such as the PRESS, which scored the highest of all measures on the PAPERS criteria, may be suitable in such instances where direct measurement of EBI delivery cannot be obtained. Importantly, despite its high relative rating, the PRESS still lacks evidence of important psychometric properties including predictive validity, concurrent validity, and responsiveness. Furthermore, none of the three measures of sustainability covered all five domains of Moore et al. [7] definition. This is likely due to most of the measure assessing more specific constructs or aspects of sustainability, rather than the broader definition of sustainability used by Moore. For instance, sustainment has been recognised as a distinct concept, defined as the ongoing delivery of an evidence-based intervention [2,8,11,32] and which was the focus of some of the measures included in this review, including PRESS [32]. As we were attempting to provide a comprehensive review of all quantitative measures related to sustainability we took a broad definition and included any related measures to sustainability. When developing and selecting measures for use, it is essential that one clearly defines the target construct and selects a measure that clearly aligns with their construct of interest.

Measures of determinants of sustainability
Compared to measures of sustainability as an outcome we identified a large number of measures that aligned with our definition of determinants of sustainability, with 26 (out of the total 28) measures identified. Eight of the 28 measures were published since 2020, highlighting a recent increase in measure development, but several limitations exist. In terms of content validity, only 10 covered all 5 higher-level domains of the Integrated Sustainability Framework (see Table 3). While some of the measures (e.g., Sustainment Leadership Scale) were designed to cover only specific domains of determinants, the trade-off is a lack of a comprehensive assessments of sustainability. Few measures comprehensively covered all aspects of the "outer contextual factors" domain, which is a critical domain warranting multiple perspectives.
In terms of the psychometric and pragmatic qualities, the quality of these measures varied substantially with the PAPERS ratings ranging from as low as 15 to as high as 29 out of a possible score of 56. For psychometric properties, the largest gaps relate to discriminant validity, predictive validity, and responsiveness, highlighting opportunities for future research. For the pragmatic criteria all measures rated well for the items of cost and language. However, ease of interpretation was rated as minimal/emerging for all but ten of the sustainability determinants measures (see Table 5). Very few provided explicit instructions on how to score and interpret the measure. In fact, only two measures, the "National Health Service (NHS) Sustainability Model and Guide" [47] and the "Office of Adolescent Health (OAH) Sustainability Assessment" [56] provided explicit and detailed cutoff values and labels to enable classification of those at a greater risk of not sustaining delivery of an EBI. However, neither of these two measures have undergone comprehensive psychometric evaluation, and thus, the validity of these cut-points has not yet been examined.

Recommendations for use of current measures
Based on the evidence presented in this review, there are limitations to all identified measures of sustainability and determinants of sustainability. However, we recommend the following.
• If objective measures of sustainability are not available or feasible, the PRESS measure should be considered as a measure of sustainability as an outcome, as it is the most psychometrically robust and pragmatic to date. Future research should strive to establish evidence of predictive validity and responsiveness for the PRESS measure to further enhance its psychometric properties. • For measures of determinants of sustainability SUB-SIST had the highest PAPERS score of 29. If evaluating school-wide positive behavioral interventions and supports, the SUBSIST should be considered as a measure of sustainability determinants for this EBI. However, it is not appropriate when considering other EBIs. • In the context of other EBIs the CSAT and SMSS both had an overall PAPERS rating of 28, illustrating favourable psychometric and pragmatic qualities compared to other measures of sustainability determinants. It is recommended that the CSAT is considered for use when assessing sustainability determinants in a clinical setting and SMSS for other settings. • In general, researchers wishing to use measures to assess the determinants of sustainability should carefully assess the psychometric and pragmatic qualities of each measure, as well as the specific characteristics to which the measure was designed to assess. The information provided in the tables within this paper should assist end-users to select the most robust and suitable measure for their context. • Furthermore, when selecting a measure for use, the specific construct wishing to be measured should be carefully considered and a measure selected that aligns with the construct of interest.

Limitations
There are limitations that should be considered when interpreting these results. First, we only included measures that were explicitly stated to be designed for broad, standardised use. This decision was made to avoid inclusion of one-off study-specific measures. This process may have missed some relevant measures that could potentially be used elsewhere. Second, we only included quantitative measures as we were interested in reflective measures that offered an efficient and comprehensive means of measuring and tracking sustainability as an outcome and sustainability determinants. This decision resulted in the exclusion of several sustainability-related tools that can be used to help support the planning and assessment of sustainability (e.g., RE-AIM and extension of RE-AIM focused on sustainability [82,83], Long-Term Success Tool [84]). While these tools are useful in planning for, or tracking aspects of sustainability, they are not designed solely for quantitative measurement and thus were beyond the scope of this review. These exclusions also highlight the difficulties that can be faced by researchers and practitioners when attempting to select an appropriate, rigorous, and standarised quantitative measure of these concepts. Third, we classified a measure as covering a particular construct of interest if it included at least one item relating to a construct. This is in contrast to other reviews that have used a criteria of at least two items [23,76]. We used a more liberal approach to ensure that we did not underestimate the content coverage of current measure, as we were mostly interested in assessing whether measures were incorporating any aspect, even to a small extent, the specific constructs we were focused on. This may have overestimated the content validity of identified measures, as it is usually insufficient to adequately cover an entire construct with only one item. Four, we only searched the references lists of relevant reviews and not all eligible articles, which was a deviation from our original registered protocol. This deviation was due to the extensive volume of articles screened and identified. However, given the extensive search strategy employed, including published and grey literature, reference lists of previous reviews, snowball searching, and searching of online repositories of implementation measures, it is unlikely this deviation would have impacted significantly on our search results of eligible measures. Finally, we only evaluated the psychometric properties of measures using studies with data that was explicitly analyzed for psychometric evaluation. Studies with data analysed for other purposes and not with the aim of assessing the psychometric properties of the measure, for example, an empirical study assessing the association between the measure and another construct but not with the a-priori aim of assessing the measures validity, was not considered when scoring that measures' psychometric properties. This approach was taken as it was considered to be the most appropriate as psychometric evaluations should be pre-specified, and was also the most manageable and conservative approach for a review of this size.

Conclusion
This systematic review identified and evaluated the psychometric and pragmatic properties of standardised measures of sustainability as an outcome and sustainability determinants for use across community, public health, and clinical settings. It provides a comprehensive guide that researchers and stakeholders can use to select the most psychometrically robust, pragmatic, and relevant measure of sustainability and/or sustainability determinants available for their setting. It also highlights where future research is needed to improve the psychometric and pragmatic quality of the current measures in this field.